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ABSTRACT 

In this paper we provide a principled approach to solve a 
transductive classification problem involving a similar graph 
(edges tend to connect nodes with same labels) and a dis- 
similar graph (edges tend to connect nodes with opposing la- 
bels). Most of the existing methods, e.g., Information Regu- 
larization (IR), Weighted vote Relational Neighbor classifier 
(WvRN) etc, assume that the given graph is only a similar 
graph. We extend the IR and WvRN methods to deal with 
mixed graphs. We evaluate the proposed extensions on sev- 
eral benchmark datasets as well as two real world datasets 
and demonstrate the usefulness of our ideas. 

Categories and Subject Descriptors: 1.5 [Pattern Recog- 
nition] Design Methodology - Classifier design and evalua- 
tion 

General Terms: Algorithms, Experimentation 

Keywords: Classification, Graph based semi-supervised 
learning, Transductive learning, Mixed graphs 

1. INTRODUCTION 

Consider the problem of transductive classification in a 
relational graph consisting of labeled and unlabeled nodes. 
Most methods for this problem assume that connected nodes 
have the same labels. In many applications this assumption 
is violated to varying degrees depending on the underlying 
relational graph; that is, many edges can be formed using 
pairs of nodes having different class lables (this is referred 
to as label dissimilarity). When this happens the perfor- 
mance of the methods can deteriorate significantly. If such 
'dissimilar' edges can be identified via domain knowledge or 
other ways, they can be eliminated to improve the perfor- 
mance. Even better, it makes sense to collect the identified 
dissimilar edges in a dissimilar graph and use it differently 
but together with the similar graph (set of edges connecting 
nodes having same labels) to improve classification. This 
paper is rooted on this point. Let us refer to the combi- 
nation of similar and dissimilar graphs simply as a mixed 
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graph. Recently Goldberg et al. [3j extended the graph- 
based semi-supervised learning method of Sindhwani et al 
[7] to deal with mixed graphs. In this method classification 
is fundamentally based on content features of nodes, with 
the mixed graph strongly guiding the classification. If fi 
and fj denote the classifier outputs associated with nodes i 
and j that form a dissimilar edge, Goldberg et al.'s method 
[3J includes a loss term, (fi + fj)' 2 in the training objec- 
tive function, thus putting pressure on /, and fj to have 
opposing signs. In many applications, content features are 
either weak or unavailable. Such problems have to be ad- 
dressed in a purely graph transductive setting. In another 
related work, Tong and Jin [TD] proposed a graph based ap- 
proach using semi-definite programming (SDP) to explore 
both similar and dissimilar graphs. The problem solved in 
their work is a non-convex programming problem whose so- 
lution can lead to local optima. In contrast the proposed 
methods in this paper are simpler and more efficient. Fur- 
ther our extension of the information regularization method 
for mixed graph (IR-MG) leads to a convex programming 
problem and the proposed algorithm converges to the global 
solution. 

The main aim of this paper is to extend and explore ex- 
isting methods for a transductive setting to deal with mixed 
graphs (even when non-content based relational graphs are 
available). We only take up binary classification in this 
paper. There are many worthy methods in this group of 
methods; examples are: Information Regularization (IR) 
PQ, Weighted vote Relational Neighbor classifier (WvRN) 
[5], Local and Global Consistency (LGC) pj] and Gaussian 
Function Harmonic Field (GFHF) 12 . To keep the paper 
short we take only the first two methods for extension. Both 
these methods are based on probabilistic ideas; thus, instead 
of the squared loss used by Goldberg et al. [3J, we devise a 
divergence-based convex loss function to deal with dissimilar 
edges. Empirical results show that the extensions are very 
effective, although the ideas are simple and straight-forward. 

Depending on the way they are formed, the similarity and 
dissimilarity graphs in a given problem may differ in pure- 
ness. So it is useful to have a hyperparameter (7) that mixes 
the effects of these two graphs (e.g., relative weighting be- 
tween the losses corresponding to the two graphs) . We make 
use of such a parameter; our experiments on the various 
datasets point to the importance of this parameter. Though 
Goldberg et al. [3j do not use such a parameter, it appears 
to be useful for that method too. The quality of the graphs 
relating to classification solution can also be approximately 
measured using a quantity called node assortativity coeffi- 



cient (NAC) [4]. NAC is easy to compute and gives a good 
indication of the usefulness of the graphs for classification. 
It can also be used to quickly select a decent value for the 7 
parameter. 

To demonstrate the effectiveness of our extended methods 
we do detailed experiments, like Goldberg et al. [3], on stan- 
dard academic benchmark datasets in which mixed graphs 
are constructed systematically but artificially. We also show 
usefulness of our methods on real world datasets involving 
web pages of shopping domains. In these problems mixed 
graphs arise naturally. For example, two web pages that 
either have strong structural similarity or have co-citation 
links from a common third page may have the same labels, 
and, web pages that have extremely poor structural corre- 
lation may have opposing labels. 

The paper is organized as follows. In section 2 we give the 
extensions of IR and WvRN for mixed graphs. In section 
3 we define NAC and discuss its usefulness; hyperparame- 
ter tuning is also discussed there. Experimental results are 
given in section 4 and we conclude with section 5. 

The following notations will be used in this paper. Let 
G = (V, E, W) be an undirected graph with V = 
{vi, ■ ■ ■ ,v„} representing the set of nodes, E and W repre- 
senting the set of edges and associated weights respectively. 
Assume that Wij > 0, Vi, j where Wij represents the edge 
weight between the nodes u, and Vj. In a graph G typ- 
ically we have both similar and dissimilar edges. Similar 
edges connect nodes belonging to same class and dissimilar 
edges connect nodes belonging to different classes. Since an 
edge can be either similar or dissimilar we can separate the 
graph G into similar and dissimilar graphs (denoted as Gs 
and G§) respectively. Then the nodes, edges and weights 
corresponding to these graphs are appropriately defined as: 
G s = (V s ,E s ,W s )andGs = (V g ,E g ,W g ). Let p, and 
qi denote two probability distributions over the set of possi- 
ble labels, associated with the node Vi. Usually P i represents 
any known or a prior distribution for node Vi and qi rep- 
resents probability distribution estimate obtained from any 
given method. In this paper we are interested only in binary 
classification problem and so P i and qi are 2-dimensional 
vectors. Also, let P = [pi, . . . , p n ] and Q = [q x , . . . , q n ]. 
Let L and U L denote the set of labeled and unlabeled nodes 
respectively. 

2. PROPOSED METHODS 

In this section we show how two existing methods, namely, 
information regularization (IR) and Weighted vote Rela- 
tional Neighbor classification (WvRN) can be extended to 
handle the mixed graph scenario. 

2.1 Information Regularization in a mixed 
graph setting 

In the conventional setting only similar edges are assumed. 
That is, we have G = Gs and the edge weights Wij , V(i, j) £ 
E in some sense indicate our belief or confidence in the as- 
sumption that the connected nodes belong to same class. 
Within that assumption, we consider solving the transduc- 
tive classification problem by optimizing the objective func- 
tion: 

F(Q;P,W) = ^O( Pl ||q0 + A s J2 ^©(qiHqi) (1) 
where T>(-) denotes any divergence measure that measures 



the dissimilarity between two distributions. Several diver- 
gence measures have been used in the literature. They 
include Kullback-Leibler (KL) divergence, Jensen-Shannon 
(JS) divergence, Jensen-Renyi (JR) divergence etc. Here, we 
consider Jensen-Shannon divergence which is a symmetric 
and smoothed version of the KL divergence. When D(-) is 
taken as the JS-divergence the regularization term is nothing 
but the information regularization proposed by Corduneanu 
and Jaakkola [l] in a graph setting. The first term in {T} 
is a data fitting term and measures how well the estimate 
qi matches the input distribution P i, Vi 6 L. The second 
term is a regularization term and it regularizes the solution 
Q* with respect to the underlying relational graph. The 
regularization constant As trades off between the data fit- 
ting and regularization terms. When two nodes are strongly 
connected their distributions are expected to be similar and 
the regularization term enforces this behavior. Clearly, if 
the individual terms are convex then the solution is unique. 

{T} assumes that all the edges are similarity edges (i.e., 
E — Es). Therefore depending on the extent to which this 
assumption is violated the performance suffers. To address 
this problem we propose the following modified objective 
function: 

F(Q;P,W) = ^©( Pl || qi ) + A s J2 mjVidiW^) 

i€L (i,j)GE s 

A s ^.i^CqillHi.aqi) 

(ij)SE s 



where Hi ; 
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is a transformation matrix and As is 



another regularization constant. Let qj = Hi^qj- Clearly 
q, = 1 — (\j is still a distribution and the transformation fa- 
cilitates divergence measurement of the distributions qi with 
the distributions 1 — qj for the edges in S. Here, 1 repre- 
sents a vector of all ones. Therefore the dissimilar edges will 
also help in reinforcing the class distributions in a positive 
way. 

Corduneanu and Jaakkola [1] proved that the solution to 
([T| with information regularization is unique. Using the 
constraint qi = pi, Vi £ L they suggested a distributed 
propagation algorithm that finds the solution in an iterative 
fashion. In a similar way one can show that ((2)1 is also convex 
and that the solution can be found in an iterative fashion. 
The proof is based on a standard log sum inequality and 
properties of KL-divergence measure [2]. Therefore ((2| is a 
natural extension of the information regularization approach 
in the mixed graph setting; we will refer to this method as 
information regularization for mixed graphs (IR-MG). The 
algorithm is given in 12. II 

We note that when we set qi = pi, Vi 6 L and optimize 
qi only for i £ UL then the solution Qu L is dependent only 
on the second term in ([T]), and, second and third terms in 
@. Such a setting is useful when the labels are clean and 
the graph is not extremely dense in some regions [9|. Both 
these requirements can be often met in many practical ap- 
plications. When they cannot be met, methods proposed 
in [9] are useful to solve Q. Such methods can be appropri- 
ately extended to find the solution for our problem of mixed 
graphs. 

In a normalized graph setting, one way to normalize is 
to do node level normalization using its degree separately 



Algorithm 2.1 IR-MG Algorithm 
t <- and e = 0.001 

For all nodes i £ L?Z, initialize q^' to the class prior 
(obtained from known labeled nodes) and fix q; = pi Vi £ 
L. 

repeat 

for each edge (i, j) £ Es do 

u^^O.SCq^ + qf ) 
end for 

for each edge (i, j) £ Eg do 
z„ ;<-0.5(<g)+l -q<*>) 
end for 

for each element i £ UL do 

q? +1) «- |exp(7Ej|(i,j)eE s M i.i l0 S( u «) + C 1 - 

7)E J |( l , j)S E g ^>J 1 °g( Z ».j)) 

end for 

t «- t+1 

until max ieC 7 Lifc= i i2 k^ 1 ' ~ 9$ I < e 



in each graph. That is, set Ws = D s 'Ws and Wg = 
Dj'Ws where \D s ]u = Ej|(ij) S E s ) and I D s]« = 
Ej|(j j)eEj Then, we can set As = A7 and Ag = A(l — 
7) where A > and < 7 < 1. In such a case, A is the 
overall regularization constant and 7 weighs the similar and 
dissimilar contributions. When we set q^ = p^, Vi £ L, we 
have only one parameter 7. In practice since the graphs are 
impure (i.e., it may not be possible to construct pure similar 
and dissimilar graphs) to varying degrees, the 7 parameter 
plays an important role in achieving improved performance. 

2.2 WvRN Classification in a mixed graph set- 
ting 

The original probabilistic Weighted vote Relational Neigh- 
bor classifier (with relaxation labeling) method [5] was for- 
mulated to solve the collective classification problem (for 
only similar graphs) where class distributions of a subset of 
nodes are known and fixed. Then the class distributions of 
the remaining (unlabeled) nodes are obtained by an itera- 
tive algorithm. It has two components, namely, weighted 
vote relational neighbor classifier component and relaxation 
labeling (RL) component. The relaxation labeling compo- 
nent performs collective inferencing and keeps track of the 
current probability estimates qf for all unlabeled nodes at 
each time instant t. These frozen estimates q^ are used by 
the relational classifier. The relational classifier computes 
the probability distribution for each unlabeled node as the 
weighted sum probability distributions of its neighbors 
with weight wy; that is, 

3 

where k = 1, 2 and ip is a normalizing constant. Since re- 
laxation labeling may not converge, sometimes simulated 
annealing is performed to ensure convergence [5]- 
In a mixed graph setting, we can modify ^ as: 

9S +1) = ~(7 E «U$+(l-7) E 

j|(j,j)eE s ji(ij)eE s 

(4) 



Algorithm 2.2 WvRN-MG Algorithm 

t <- 0, /3 (t) <- 1, v <- 0.95 and e = 0.001 

For all nodes i £ UL, initialize q|*' to the class prior 

(obtained from known labeled nodes) and fix q^ = Vi £ 

L. 

repeat 

for each element i £ UL and k = {1, 2} do 

Qi,h <- i(7Ei|(i,i)6E s W i,3lfl + (1 ~ 

7) EiK^eEg^a-^)) 
tit" 

end for 

t <- t + 1 and /3 (t+1) «- /3 (t) * v 
until max i6[JLit=li 2 kll" 1 ' - 9$! < e 



where fc = 1,2. As in the case of IR-MG method, the pa- 
rameter 7 weighs the similar and dissimilar graphs. With 
the modification given in we refer to this method as 
WvRN-MG. The algorithm is given inl2~2l 

3. GRAPH CHARACTERISTICS AND SET- 
TING 7 

Characteristics of graphs play a major role in achieving 
good classification performance. One of the key characteris- 
tics of a relational graph is the correlation of the class vari- 
able of related entities. A graph is said to have homophily 
characteristics when the related entities in the graph have 
the same label; this was studied by early social network re- 
searchers. All the methods that make use of this assumption 
are essentially homophily based methods [5]. There is also 
a link-centric notion of homophily known as assortativity 
studied in [6]. The assortativity coefficient [6] measures the 
homophily characteristics based on the correlation between 
the classes linked by edges in the graph. Macskassy and 
Provost developed a variant of this coefficient. It is based 
on the graph's node assortativity matrix C where dj repre- 
sents, for all nodes of class yi, the average weighted fraction 
of their weighted edges that connect them to nodes of class 
yj such that J7 ■ dj = 1. Then the node assortativity co- 
efficient (NAC) N is defined as: N = E ' ] t 2^^; whe re 
ai and fo; denote the sum of the i-th row and i-th column 
respectively. This coefficient takes values in [-1,1] with the 
extremes indiciating strong connectivity between dissimilar 
and similar classes respectively. Macskassy and Provost [5] 
used this coefficient to study its usefulness in edge selection 
[5] . Macskassy [4] used this coefficient to weigh different edge 
types when there are multiple graphs. Specifically, each edge 
was scaled by its graph's N value; if it is negative the scal- 
ing factor for that edge type (graph) was set to zero. Since 
the original WvRN is a homophily based method, Macskassy 
and Provost [5] set the weight to zero for graphs having neg- 
ative iV values. We illustrate below how this coefficient can 
be used to set 7 in the mixed graph scenario. 

In our proposed methods, the mixture parameter 7 plays 
an important role since it decides the degree to which each 
graph controls the performance. In practice this parame- 
ter can be set in two ways. One way is to set 7 using the 
NAC values of similar and dissimilar graphs. Let Ns and 
N§ denote estimates of the NAC values of the similar and 
dissimilar graphs respectively. Note that, if the dissimilar 



graph is pure (for example, as in section 4.1 below) then 
N§ = —1. Therefore, we can set 7 = N ^f N ^ ■ If Ns < 
and/or Ng > it is not a good idea to use the above estimate 
of 7. For best performance it is a good idea to set 7 using 
cross-validation. However, unlike NAC based 7 estimation, 
the CV technique is expensive since we need to run the train- 
ing algorithm several times. Finally, note that, since both 
methods are based on labeled nodes, a good estimate of 7 
can be obtained only when the number of labeled nodes is 
not too small. In section 4.3 we illustrate the usefulness of 
these techniques on several benchmark datasets. 

4. EXPERIMENTS 

In this section we present results obtained from various ex- 
periments conducted on several academic benchmark datasets 
as well as real world datasets formed from web pages of shop- 
ping web sites. First we study the performances of the pro- 
posed methods, namely, IR-MG and WvRN-MG on mixed 
graphs constructed from already available relational graphs 
of benchmark datasets; these results demonstrate gains that 
accrue as a result of moving from a noisy similar graph 
towards a quite pure similar-dissimilar graph combination. 
Next, we evaluate the performances on similar and dissimi- 
lar graphs that arise naturally from web pages of shopping 
sites. Finally, we compare the relative performances of our 
methods as well as evaluate them against the method of 
Goldberg et al.[3]. 

4.1 Experiments on partitions of given graph 
into dissimilar and similar graphs 

Usually a given relational graph (G) with partially labeled 
nodes is impure and consists of both similar and dissimilar 
edges. For our experiments we extract similar and dissimilar 
graphs (denoted as Gs and G§) from G using the following 
model. Similar to the work of Goldberg et al. [3] we use 
an oracle which takes a pair of nodes and tells whether the 
edge formed by them is similar or dissimilar. We construct 
Gg by randomly picking a percentage of dissimilar edges 
(P) connecting only unlabeled nodes in G by querying the 
oracle. Note that the learner only knows that the edges are 
dissimilar; it does not know the actual labels of the nodes. 
Thus, the dissimilar graph is a pure graph consisting of only 
unlabeled nodes. Then the similar graph Gs is obtained as 
G - G§. Note that, unlike G§, Gs may not be pure. This 
is because we vary the percentage of edges picked from G to 
construct Gg; also, even if we pick all the dissimilar edges 
connecting unlabeled nodes, there can still be some dissimi- 
lar edges connecting labeled and unlabeled nodes left in Gs. 
This model is different from the model used by Goldberg et 
al. [3|- In that work, the original graph G is taken as Gs 
and, G s is constructed by taking random pairs of nodes 
having opposing labels using the oracle. Our model is ap- 
propriate when we are given a graph and there is some way 
of filtering out dissimilar edges from it. On the other hand, 
the model used by Goldberg et al. [3| is appropriate when 
we are given a similar graph and, additionally one can con- 
struct a dissimilar graph using domain knowledge. In both 
models the dissimilar graph is pure; one can also think of 
experimenting with alternate models which introduce some 
noise in the dissimilar graph. 

A summary description of various benchmark datasets 
used in the experiments is given in Table [1] All the datasets 



indicated correspond to binary classification problems. The 
datasets G50C, WINDOWSMAC, WebKB-PAGELINK and 



WebKB-LINK used in [7J are taken from http : //people . cs . uchicago . edn 
G50C is an artificial dataset generated from two unit co- 
variance normal distributions with equal probabilities; the 
means are adjusted so that the true Bayes error is 5% [7J. 
WINDOWSMAC dataset is a subset of 20-newsgroup dataset 
with the documents belonging to two categories windows and 
mac. The WebKB dataset arises from hypertext-based cat- 
egorization of web documents with two classes course and 
non-course. The WebKB-LINK dataset uses features de- 
rived from the anchortext associated with links on other 
webpages that point to a given web page. The WebKB- 
PAGELINK dataset uses both PAGE and LINK features 
where PAGE features are derived from the content of a 
page. In each of these four datasets mentioned above, follow- 
ing [7J [3] , we construct the relational graph with fc-nearest 
neighbors using Gaussian weights. Specifically, the weight 

between kNN points Xi and Xj is e 2°^ , while other 
weights are zero; k is set to 50, 10 and 200 for G50C, WIN- 
DOWSMAC and WebKB datasets respectively. We also 
consider the datasets, CORAALL and IMDBALL that do 
not have any input feature representation. They have the 
relational graph matrix W constructed purely from under- 
lying relations. The CORAALL dataset is derived from 
the CORA dataset which comprises of computer science re- 
search papers; the relational graph is constructed using both 
co-citation and common author relationships between pa- 
pers. This dataset has seven classes with each class rep- 
resenting topics like Neural Networks, Genetic Algorithms 
etc. We converted this seven class problem into 7 one ver- 
sus all binary classification problems and the corresponding 
datasets are referred as CORAALL1, CORAALL2 and so 
on, with the number indicating the positive class. The IMD- 
BALL dataset is based on networked data from the Internet 
Movie Database (IMDb) (http://www.imdb.com); here clas- 
sification is about predicting movie success determined by 
box-office receipts (high-revenue versus low-revenue) and the 
relational graph is constructed between movies by linking 
them when they share a production company. The weight 
of an edge in the resulting graph is the number of production 
companies two movies have in common [5] . The CORAALL 
and IMDBALL datasets are available with the toolkit de- 
scribed in [5]. 

Next we give more details on the experiments. We provide 
plots only for a few datasets and comment on other datasets 
when needed. For each dataset, we varied the number of 
labeled nodes (L), the mixture parameter 7 and the per- 
centage of dissimilar edges (P) in G used for forming the 
dissimilar graph. In all our experiments we considered 25 
realizations where each realization corresponds to one ran- 
dom stratified labeling of nodes. 

We present various observations from the experimental 
study conducted on all the academic benchmark datasets 
given in Table [T] Compared to using the original graph Gs 
significant performance improvements were observed with 
the use of the mixed graph, in a vast majority of cases of 
varying P, 7 and L on all the datasets. Performance re- 
sults on two representative datasets, viz. IMDBALL and 
CORAALL1 are given in figure [T] It is clearly seen that the 
best performance is achieved for some intermediate values of 
7; see for instance the results of CORAALL1, IR-MG, L=80 



Table 1: Properties of datasets: n and e denote the number of nodes and edges in G respectively; L, n/, b and 
N s denote the number of labeled nodes, the number of (content) features, percentage of positive examples 
and node assortativity coefficient values respectively. 
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and 200. This demonstrates that although the similar graph 
is noisy, it is still useful in the mixed graph setting to get im- 
proved performance. In the case of IMDBALL dataset, the 
best performance is achieved at low value of 7 and smaller P 
values; this is because the similar graph is more noisy (with 
the original graph having a node assortativity coefficient of 
only 0.36). However, for large P values, the similar graph 
becomes purer (but still noisy) and the best performance is 
achieved again for some intermediate values of 7. 

We also conducted paired-t statistical significance tests to 
compare IR-MG and WvRN-MG methods on each dataset. 
On the original graph, the WvRN-MG method was slighly 
better on WebKB-PAGELINK, CORAALL1, CORAALL2, 
CORAALL3 and CORAALL4 datasets and the significance 
reduces as the number of labeled nodes is increased. Next we 
consider the mixed graph case. In the case of CORA ALL 1 
dataset, we observed that the IR-MG method started per- 
forming better in an intermediate range of values of 7 as the 
graph becomes purer. At higher 7 values (corresponding 
to the original graph when P — and subsequently purer 
similar graph as P increases), there was no statistical sig- 
nificance found. Similar observations were found in the case 
of IMDBALL dataset. Overall we found that the IR-MG 
method performs better on purer graphs. 

In practice we need automatic ways of using domain knowl- 
edge or otherwise to identify similar and dissimilar edges. 
This is an important research topic; but it is beyond the 
scope of this paper. In several applications similar and dis- 
similar graphs occur naturally, and both the graphs are typ- 
ically noisy. We demonstrate the usefulness of the proposed 
methods on one such application next. 

4.2 Evaluation on natural graphs from shop- 
ping sites 

We also evaluated the proposed methods on natural graphs 
constructed using structural signature (shingle) of web pages 
from shopping sites http : / /www . uncommongoo dsTcom| (referred 
as UG) and http : //www. compusa. com (referred as CU). The 



similar and dissimilar graphs were constructed as follows. A 
similar edge between two pages was formed when their struc- 
tural signatures had a match score of at least 6 (the values 
are in the range [0,8]) and, a dissimilar edge was put when 
the match score was (Q. In practice both the dissimilar and 
similar graphs have noise since the signatures are not accu- 
rate. We considered two binary classification problems. In 
the first problem, the goal was to differentiate product de- 
tail pages from the rest. In the second problem, the intent 
was to distinguish product listing pages from others. The 
properties of the datasets are given in Table [T] 

Since the similar and dissimilar graphs are fixed we var- 
ied only the number of labeled nodes (L = 40, 80, 160). We 
evaluated the AUC performance of the IR-MG and WvRN- 
MG methods on the similar graph (7 = 1) and dissimilar 
graph (7 = 0) separately. Further, we evaluated the perfor- 
mance on the mixed graph for the values of 7 set by the NAC 
and CV based estimation techniques. To study the quality 
of these two estimation techniques, we also found the best 
AUC score given by the optimal 7 (searched over a grid of 
7 values in the interval [0, 1] used in the cross-validation) . 
The average performance over 25 partitions for each of these 
settings is presented in figure 2. It is clearly seen that the 
performance with the dissimilar graph is inferior compared 
to the performance with the similar graph, particularly when 
L is small. This correlates well with the NAC values given 
in Table [1] Although the dissimilar graph is quite impure, 
it is still useful. This is clearly seen in figure 2 where the 
performance with the mixed graph is better than the per- 
formance with similar and dissimilar graphs used alone; see 
for instance the results for CU-Listing, WvRN, L=40. This 
improvement is quite significant when L is small. Further, 
the performance with the cross-validation choice of 7 is very 
close to the best performance and is only slightly inferior 
when L = 40. The NAC based estimate of 7 becomes use- 

x We used binary representation (i.e., edge with unit weight 
or no edge) for the graphs since the signatures are not ac- 
curate. 
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Figure 1: AUC score performance the of IR-MG and WvRN-MG methods on IMDBALL and CORAALL1 
datasets under two different label size conditions. The numbers in the legend (applicable for all plots) indicate 
the percentage of dissimilar edges (with respect to the total number of dissimilar edges connecting unlabeled 
nodes) in G§. The dotted black line indicate the performance with the original graph G. 
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Figure 2: AUC score performance of the IR-MG and WvRN-MG methods on two shopping domain datasets 
under three different label sizes (40, 80 and 160 - indicated as 1, 2 and 3) for dissimilar (dark blue), similar 
(blue) and mixed graph (3 cases - with NAC (green), CV (orange) and the best (maroon) 7 values) (in that 
order). 



Table 2: AUC Performance comparison of Goldberg et al., IR-MG and WvRN-MG methods on various 
datasets. The number of labeled examples (L) used in each dataset is indicated in parentheses. The number 
of realizations in each case was 25. The 7 values used in IR-MG and WvRN-MG are indicated in parentheses 
- here, NAC and CV indicate the techniques that were used to set 7. 



Dataset 


Method 


P = 5 


P = 10 


P = 20 


G50C (50) 


WvRN-MG (NAC) 
WvRN-MG (CV) 
IR-MG (NAC) 
IR-MG (CV) 
Goldberg et al. 


0.9844 ± 0.0043 
0.9916 ± 0.0042 
0.9851 ± 0.0042 
0.9914 ± 0.0055 
0.9886 ± 0.0016 


0.9886 ± 0.0040 
0.9930 ± 0.0061 
0.9892 ± 0.0039 
0.9938 ± 0.0059 
0.9946 ± 0.0011 


0.9983 ± 0.0014 
0.9970 ± 0.0043 
0.9986 ± 0.0012 
0.9967 ± 0.0048 
0.9980 ± 0.0007 


WINDOWSMAC (100) 


WvRN-MG (NAC) 
WvRN-MG (CV) 
IR-MG (NAC) 
IR-MG (CV) 
Goldberg et al. 


0.9632 ± 0.0056 
0.9811 ± 0.0091 
0.9639 ± 0.0056 
0.9815 ± 0.0090 
0.9714 ± 0.0029 


0.9714 ± 0.0050 
0.9887 ± 0.0084 
0.9722 ± 0.0050 
0.9883 ± 0.0082 
0.9863 ± 0.0012 


0.9927 ± 0.0026 
0.9938 ± 0.0061 
0.9933 ± 0.0024 
0.9940 ± 0.0015 
0.9950 ± 0.0003 


WebKB-LINK (40) 


WvRN-MG (NAC) 
WvRN-MG (CV) 
IR-MG (NAC) 
IR-MG (CV) 
Goldberg et al. 


0.9465 ± 0.0120 
0.9626 ± 0.0073 
0.9432 ± 0.0113 
0.9614 ± 0.0077 
0.9451 ± 0.0260 


0.9524 ± 0.0120 
0.9723 ± 0.0059 
0.9499 ± 0.0118 
0.9718 ± 0.0062 
0.9545 ± 0.0230 


0.9696 ± 0.0074 
0.9800 ± 0.0041 
0.9693 ± 0.0074 
0.9801 ± 0.0042 
0.9607 ± 0.0201 



ful for sufficiently large values of L. The performance dif- 
ference between the IR-MG and WvRN-MG methods was 
statistically significant at the level of 0.05 only on the CU- 
Product and CU-Listing datasets when L = 40. We have 
not reported the results for the UG-Product dataset since 
the AUC scores were almost same (around 0.99) for all the 
graphs and methods. 

4.3 Comparison with Goldberg et al.'s method 

Since Goldberg et al.'s method [3] depends on content fea- 
tures we restrict our comparison to the four datasets, G50C, 
WINDOWSMAC, WebKB-PAGELINK and WebKB-LINK. 
Goldberg et al. give two methods: one is based on regu- 
larized least squares (Lap-RLSC) and the other is based on 
SVMs (Lap-SVM) [7J. Both methods perform similarly. We 
use Lap-RLSC for comparing against IR-MG and WvRN- 
MG. For IR-MG and WvRN-MG we tuned 7 using both 
cross validation (CV) and NAC values; CV tuning is ob- 
viously better and it is the one that should be used. The 
results for the methods are given in Table [2] for various val- 
ues of P. Clearly all three methods give competitive perfor- 
mance. The results are statistically significant for lower val- 
ues of P. As in |3J, for Goldberg et al.'s method we did not 
tune the hyperparameters for each choice of P. In the next 
section we show how tuning can be done and demonstrate 
its usefulness. In terms of computational speed Goldberg et 
al.'s method is comparable with IR-MG; WvRN-MG has an 
advantage over the other two methods because it is much 
faster (> 10 times) and also provides decent competitive 
performance. 

4.4 Setting up 7 parameter in Goldberg et al.'s 
method 

The above experiments clearly indicate the importance 
of 7 in the mixed graph to get improved performance. It 
would be useful to introduce such a parameter in Goldberg 
et al.'s method [3] also. One way of doing this is as follows. 
In their method there is a graph regularization term f T Mf 



which smoothens the decision function. Here, f corresponds 
to a vector of function values at the nodes of the graph 
and the matrix M is a mixed graph analog of the graph 
Laplacian L. The combinatorial graph Laplacian matrix L is 
defined as L = D W where D is the diagonal degree matrix 
with Da — 537=1 Wi i anc ' normanze d version is given as: 
Lat = I-D ^WD 5. M is defined as: M = L + (1 — J) • 
W where 1 is a matrix of all ones and • is the Hadamard 
(elementwise) product. J is an edge type matrix with (i,j) th 
element Jij = 1 if there is a similarity edge between 
Jij = — 1 if there is a dissimilarity edge. To introduce 7 
we can modify M to be a convex combination of matrices 
Ms and M5 corresponding to the similar and dissimilar 
graphs; that is, we set M = 7MS + (1 — 7)Mg. Using 
convex combination of Laplacian has been studied [8 j in the 
context of multiview learning. Here, Ms is nothing but the 
graph Laplacian Ls obtained using Ws and M<j = Lg + 
2W_g. To verify the usefulness of this we conducted a simple 
experiment on the LINK dataset by setting 7=0.7, P=1.0 
and L=20. While the original method gave an average AUC 
score of 0.93, the modified method gave a value of 0.96. Like 
earlier, 7 can be tuned using cross-validation along with the 
other hyperparameters. 

5. CONCLUSION 

In this paper we provided a principled approach to extend 
probabilistic scores based transductive classification meth- 
ods for mixed graphs. The proposed methods are simple 
and efficient. We highlighted the importance of hyperpa- 
rameter optimization and showed how this parameter can 
be optimized particularly when the number of labeled nodes 
is not too small. Experiments on several benchmark and real 
world datasets show the usefulness of the proposed methods. 
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